Daysim Comparison Analysis - Key Findings

Published

December 12, 2025

Methodology Note: All comparisons filter to:

  1. households present in both pipelines, and
  2. persons who report the same days (Tue/Wed/Thu) in both pipelines.

This excludes legacy’s artificially reassigned persons who traveled Monday but were assigned to weekdays for the “wt-wkday_3day” weighting scheme.


Summary of Differences

1. Time Encoding Difference [EXPLAINED BY SPECIFICATION CHANGE]

Status: Different time encoding between pipelines

Description:

  • Legacy uses HHMM format (e.g., 1244 for 12:44 PM)
  • New pipeline uses minutes since midnight (e.g., 764 for 12:44 PM = 12×60+44) per DaySim specification

The new pipeline follows the current DaySim standard for time encoding.


2. Person-Days Difference [EXPLAINED BY MONDAY REASSIGNMENT]

Status: ✓ Fixed - New pipeline now includes all completed diary days

Code
# Filter to common days
legacy_days = legacy["personday"].filter(pl.col("day").is_in(DAYS_TO_COMPARE))
new_days = new["personday"].filter(pl.col("day").is_in(DAYS_TO_COMPARE))

# Totals
print(f"**Person-days (days 2-3-4, common households):**")
print(f"- Legacy: {len(legacy_days):,}")
print(f"- New: {len(new_days):,}")
print(f"- Difference: {len(new_days) - len(legacy_days):+,} ({(len(new_days) - len(legacy_days))/len(legacy_days)*100:+.1f}%)")
print()

# Unique persons
legacy_persons = legacy_days.select(["hhno", "pno"]).unique()
new_persons = new_days.select(["hhno", "pno"]).unique()

print(f"**Unique persons with days 2-4:**")
print(f"- Legacy: {len(legacy_persons):,}")
print(f"- New: {len(new_persons):,}")
print(f"- Difference: {len(new_persons) - len(legacy_persons):+,}")
print()

# Calculate average days per person
avg_days_legacy = len(legacy_days) / len(legacy_persons)
avg_days_new = len(new_days) / len(new_persons)
print(f"**Average person-days per person:**")
print(f"- Legacy: {avg_days_legacy:.2f} days/person")
print(f"- New: {avg_days_new:.2f} days/person")
**Person-days (days 2-3-4, common households):**
- Legacy: 32,271
- New: 36,921
- Difference: +4,650 (+14.4%)

**Unique persons with days 2-4:**
- Legacy: 12,307
- New: 12,307
- Difference: +0

**Average person-days per person:**
- Legacy: 2.62 days/person
- New: 3.00 days/person

Methodological difference identified: The legacy pipeline’s 03b-assign_day step creates person-day records based on day completion flags (tue_complete, wed_complete, thu_complete) rather than actual travel days:

  • A person who traveled on Monday (day 1) but marked multiple weekdays as complete gets multiple person-day records (one for each completed day)
  • The new pipeline correctly uses the actual travel_dow (day of week when travel occurred)
  • Result: Same unique persons, but legacy has fewer total person-days because it’s reassigning Monday travelers to weekdays for the “wt-wkday_3day” weighting scheme
  • Legacy averages {avg_days_legacy:.2f} days/person vs New’s {avg_days_new:.2f} days/person - the new pipeline includes more actual travel days

The legacy “wt-wkday_3day” is a modeling construct for 3-day weekday weighting, not actual survey days. The new pipeline correctly represents actual travel behavior.


3. Tour Matching Analysis [SUBSTANTIAL AGREEMENT]

Status: High tour-level agreement between pipelines

Note: Tours are filtered to common households and common persons who report days 2-4 in both pipelines. This excludes legacy’s artificially reassigned persons who traveled Monday.

Matching Methodology: Tours are matched between pipelines using household ID, person ID, day, and three tour times: departure from origin (tlvorig), arrival at primary destination (tardest), and arrival back at origin (tarorig). Since legacy uses HHMM time format and new uses minutes since midnight, new pipeline times are converted to HHMM for matching. A tour matches if all six fields are identical between pipelines.

Code
def _minutes_to_hhmm(minutes: int) -> int:
    """Convert minutes since midnight to HHMM format."""
    if minutes < 0:
        return -1
    hours = minutes // 60
    mins = minutes % 60
    return hours * 100 + mins

# NOTE: Tours are already filtered to common households and common persons
# (persons who report days 2-4 in both pipelines) from the setup section above.
# This excludes the 2,893 persons who were artificially reassigned by legacy.

# Filter to the days we're comparing
legacy_tours = legacy["tour"].filter(pl.col("day").is_in(DAYS_TO_COMPARE))
new_tours = new["tour"].filter(pl.col("day").is_in(DAYS_TO_COMPARE))

legacy_tour_count = len(legacy_tours)
new_tour_count = len(new_tours)

# Convert new pipeline times to HHMM for matching
new_tours_with_hhmm = new_tours.with_columns([
    pl.col("tlvorig").map_elements(_minutes_to_hhmm, return_dtype=pl.Int64).alias("tlvorig_hhmm"),
    pl.col("tardest").map_elements(_minutes_to_hhmm, return_dtype=pl.Int64).alias("tardest_hhmm"),
    pl.col("tarorig").map_elements(_minutes_to_hhmm, return_dtype=pl.Int64).alias("tarorig_hhmm"),
])

leg_tours_with_hhmm = legacy_tours.with_columns([
    pl.col("tlvorig").alias("tlvorig_hhmm"),
    pl.col("tardest").alias("tardest_hhmm"),
    pl.col("tarorig").alias("tarorig_hhmm"),
])

# Match on household, person, day, and tour times
MATCH_COLS = ["hhno", "pno", "day", "tlvorig_hhmm", "tardest_hhmm", "tarorig_hhmm"]

leg_tours_match = leg_tours_with_hhmm.select(
    MATCH_COLS + ["pdpurp", "tmodetp"]
).rename({"pdpurp": "pdpurp_leg", "tmodetp": "tmodetp_leg"})

new_tours_match = new_tours_with_hhmm.select(
    MATCH_COLS + ["pdpurp", "tmodetp"]
).rename({"pdpurp": "pdpurp_new", "tmodetp": "tmodetp_new"})

# Inner join to get matched tours
matched = leg_tours_match.join(new_tours_match, on=MATCH_COLS, how="inner")
match_rate = len(matched) / legacy_tour_count * 100

print(f"**Tour Counts:**")
print(f"- Legacy: {legacy_tour_count:,} tours")
print(f"- New: {new_tour_count:,} tours")
print(f"- Matched (by hhno, pno, day, times): {len(matched):,} ({match_rate:.1f}%)")
print()

# Analyze matched tours - individual tour-level comparison
purpose_match = matched.filter(pl.col("pdpurp_leg") == pl.col("pdpurp_new"))
mode_match = matched.filter(pl.col("tmodetp_leg") == pl.col("tmodetp_new"))
both_match = matched.filter(
    (pl.col("pdpurp_leg") == pl.col("pdpurp_new"))
    & (pl.col("tmodetp_leg") == pl.col("tmodetp_new"))
)

print(f"**Matched Tour Attribute Consistency (tour-level):**")
print(f"- Purpose matches: {len(purpose_match):,} ({len(purpose_match)/len(matched)*100:.1f}%)")
print(f"- Mode matches: {len(mode_match):,} ({len(mode_match)/len(matched)*100:.1f}%)")
print(f"- Both match: {len(both_match):,} ({len(both_match)/len(matched)*100:.1f}%)")
**Tour Counts:**
- Legacy: 39,587 tours
- New: 39,525 tours
- Matched (by hhno, pno, day, times): 33,177 (83.8%)

**Matched Tour Attribute Consistency (tour-level):**
- Purpose matches: 32,534 (98.1%)
- Mode matches: 32,616 (98.3%)
- Both match: 31,983 (96.4%)

Key Finding: 84.5% of tours match between pipelines when comparing household, person, day, departure time, destination arrival time, and return time. Among matched tours, 97.8% have matching purpose and 98.3% have matching mode at the individual tour level.

Unmatched Tours: The 15.5% unmatched tours likely reflect differences in tour extraction logic for complex travel patterns, multi-stop sequences, or edge cases in tour boundary detection.

  • Buffer (new) vs TAZ-based (legacy) destination detection
  • Handling of loop-trips in new pipeline (specifically home-based loop trips)
  • Orphaned linked trips in new pipeline (excessive dwell times between trips causing tour breaks)

Code
# Analyze unmatched tours
unmatched_legacy = leg_tours_match.join(new_tours_match, on=MATCH_COLS, how="anti")
unmatched_new = new_tours_match.join(leg_tours_match, on=MATCH_COLS, how="anti")

print(f"**Unmatched Details:**")
print(f"- Unmatched legacy: {len(unmatched_legacy):,} tours")
print(f"- Unmatched new: {len(unmatched_new):,} tours")
print()

# Check for near-matches (matching without return time)
MATCH_NO_RETURN = ["hhno", "pno", "day", "tlvorig_hhmm", "tardest_hhmm"]
leg_no_return = leg_tours_match.select(MATCH_NO_RETURN + ["tarorig_hhmm"]).rename({"tarorig_hhmm": "tarorig_leg"})
new_no_return = new_tours_match.select(MATCH_NO_RETURN + ["tarorig_hhmm"]).rename({"tarorig_hhmm": "tarorig_new"})
matched_no_return = leg_no_return.join(new_no_return, on=MATCH_NO_RETURN, how="inner")

additional_matches = len(matched_no_return) - len(matched)

print(f"**Near-Matches (same departure/arrival, different return):** {additional_matches:,} tours")
print(f"**Fundamental differences (different departure or arrival):** {len(unmatched_legacy) - additional_matches:,} tours")
**Unmatched Details:**
- Unmatched legacy: 6,410 tours
- Unmatched new: 6,348 tours

**Near-Matches (same departure/arrival, different return):** 104 tours
**Fundamental differences (different departure or arrival):** 6,306 tours

These unmatched tours suggest differences in tour extraction algorithms for complex travel patterns.


4. Trip Count Difference [CONSISTENT WITH TOURS]

Note: Trips are filtered to common households and common persons who report days 2-4 in both pipelines. This excludes legacy’s artificially reassigned persons who traveled Monday.

Code
legacy_trips = legacy["trip"].filter(
    pl.col("hhno").is_in(common_hhnos) & pl.col("day").is_in(DAYS_TO_COMPARE)
)
new_trips = new["trip"].filter(
    pl.col("hhno").is_in(common_hhnos) & pl.col("day").is_in(DAYS_TO_COMPARE)
)

print(f"Legacy: {len(legacy_trips):,} trips | New: {len(new_trips):,} trips | Difference: {len(new_trips) - len(legacy_trips):+,}")
Legacy: 114,131 trips | New: 112,934 trips | Difference: -1,197

Trip count differences are consistent with tour differences, suggesting different tour extraction logic rather than missing trips.


Data Quality Observations

5. Tour Purpose Distribution Differences [POSSIBLE CODING DIFFERENCE]

Status: Different purpose code mapping between pipelines

Note: This comparison is filtered to the common households and persons (those with matching days 2-4 in both pipelines), consistent with the tour and trip comparisons above.

Legacy uses codes 10 & 11: - Purpose 10 = Change Mode - Purpose 11 = Other/Overnight - May reflect original survey coding scheme

New pipeline uses codes 8 & 9: - Purpose 8 = Change Mode (standard DaySim) - Purpose 9 = Other (standard DaySim) - Follows current DaySim specification

Code
# Purpose names (DaySim codes)
purpose_names = {
    0: "Home",
    1: "Work",
    2: "School",
    3: "Escort",
    4: "Personal Business",
    5: "Shopping",
    6: "Meal",
    7: "Social/Recreation",
    8: "Change Mode",
    9: "Other",
    10: "Change Mode",  # Legacy code
    11: "Other",  # Legacy code
}

# Map legacy codes to semantic categories for comparison
# Legacy uses 10/11, new uses 8/9 for the same purposes
legacy_to_semantic = {
    0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7,
    10: 8,  # Legacy change mode (10) -> semantic change mode (8)
    11: 9,  # Legacy other (11) -> semantic other (9)
}
new_to_semantic = {i: i for i in range(10)}  # New codes are already semantic

# Get distributions with semantic mapping
legacy_dist = (
    legacy_tours.with_columns(
        pl.col("pdpurp").replace_strict(legacy_to_semantic, default=pl.col("pdpurp")).alias("semantic_purp")
    )
    .group_by("semantic_purp")
    .agg(pl.len().alias("legacy_count"))
    .with_columns([
        pl.col("legacy_count").cast(pl.Int64),
        (pl.col("legacy_count") / pl.col("legacy_count").sum() * 100).alias("legacy_pct"),
    ])
    .rename({"semantic_purp": "pdpurp"})
    .sort("pdpurp")
)

new_dist = (
    new_tours.group_by("pdpurp")
    .agg(pl.len().alias("new_count"))
    .with_columns([
        pl.col("new_count").cast(pl.Int64),
        (pl.col("new_count") / pl.col("new_count").sum() * 100).alias("new_pct"),
    ])
    .sort("pdpurp")
)

# Join distributions
comparison = (
    legacy_dist.join(new_dist, on="pdpurp", how="full", coalesce=True)
    .with_columns([
        pl.col("legacy_count").fill_null(0),
        pl.col("new_count").fill_null(0),
        pl.col("legacy_pct").fill_null(0.0),
        pl.col("new_pct").fill_null(0.0),
    ])
    .with_columns([
        (pl.col("new_count") - pl.col("legacy_count")).alias("change"),
        pl.when(pl.col("legacy_count") == 0)
        .then(None)
        .otherwise(((pl.col("new_count") - pl.col("legacy_count")).cast(pl.Float64) / pl.col("legacy_count").cast(pl.Float64)) * 100)
        .alias("pct_change"),
        pl.col("pdpurp")
        .map_elements(lambda x: purpose_names.get(x, f"Unknown({x})"), return_dtype=pl.Utf8)
        .alias("purpose_name"),
    ])
    .select(["pdpurp", "purpose_name", "legacy_count", "new_count", "change", "pct_change"])
    .sort("pdpurp")
)

from IPython.display import Markdown, display
display(Markdown(comparison.to_pandas().to_markdown(index=False, floatfmt=".1f")))
pdpurp purpose_name legacy_count new_count change pct_change
1 Work 6808 6340 -468 -6.9
2 School 1245 1826 581 46.7
3 Escort 6776 6898 122 1.8
4 Personal Business 6677 7100 423 6.3
5 Shopping 4039 4236 197 4.9
6 Meal 3826 3313 -513 -13.4
7 Social/Recreation 9055 8992 -63 -0.7
9 Other 1161 820 -341 -29.4

Analysis:

Purpose code mapping differences:

  • Survey code 11 (change mode): Legacy maps to DaySim code 10 (02a-reformat.py#L503), New maps to DaySim code 8 (mappings.py#L169)
  • Survey code 12/13 (overnight/other): Legacy maps to DaySim code 11, New maps to DaySim code 9
  • The comparison above normalizes these codes to compare semantically equivalent purposes (legacy 10→8, legacy 11→9)

Both pipelines handle these purposes correctly; the difference is in which DaySim integer codes they use. The new pipeline follows current DaySim standards (8=change mode, 9=other).

Change mode tours: Change mode should not exist as a tour purpose - it’s an intermediate activity that the trip linker should merge with adjacent trips. The presence of change mode tours indicates the linker failed to connect these trips, typically because:

  • Time gap between trips exceeds 120 minutes (link.py#L86), OR
  • Distance between trip endpoints exceeds 100 meters (link.py#L87)

These tours represent data quality issues (missing trips, incorrect timing/geocoding) or edge cases where someone only traveled to a mode transfer location and back. The new pipeline now flags these as CHANGE_MODE in the tour data quality field (tours.py#L47, validation_helpers.py#L169) and filters them out during DaySim formatting (format_daysim.py#L79).


6. Mode Distribution Differences [CLASSIFICATION LOGIC DIFFERENCES]

Status: Different mode inference logic between pipelines

Code
# Mode names
mode_names = {
    0: "Other",
    1: "Walk",
    2: "Bike",
    3: "SOV",
    4: "HOV2",
    5: "HOV3+",
    6: "Walk-Transit",
    7: "Drive-Transit",
    8: "School Bus",
    9: "TNC",
}

# Get distributions
legacy_mode_dist = (
    legacy_tours.group_by("tmodetp")
    .agg(pl.len().alias("legacy_count"))
    .with_columns(pl.col("legacy_count").cast(pl.Int64))
    .sort("tmodetp")
)

new_mode_dist = (
    new_tours.group_by("tmodetp")
    .agg(pl.len().alias("new_count"))
    .with_columns(pl.col("new_count").cast(pl.Int64))
    .sort("tmodetp")
)

# Join distributions
mode_comparison = (
    legacy_mode_dist.join(new_mode_dist, on="tmodetp", how="full", coalesce=True)
    .with_columns([
        pl.col("legacy_count").fill_null(0),
        pl.col("new_count").fill_null(0),
    ])
    .with_columns([
        (pl.col("new_count") - pl.col("legacy_count")).alias("change"),
        pl.when(pl.col("legacy_count") == 0)
        .then(0.0)
        .otherwise(((pl.col("new_count") - pl.col("legacy_count")).cast(pl.Float64) / pl.col("legacy_count").cast(pl.Float64)) * 100)
        .alias("pct_change"),
        pl.col("tmodetp")
        .map_elements(lambda x: mode_names.get(x, f"Unknown({x})"), return_dtype=pl.Utf8)
        .alias("mode_name"),
    ])
    .select(["tmodetp", "mode_name", "legacy_count", "new_count", "change", "pct_change"])
    .sort("tmodetp")
)

# Show notable differences
large_diffs = mode_comparison.filter(pl.col("pct_change").abs() > 30).sort(
    pl.col("change").abs(), descending=True
)
from IPython.display import Markdown, display
display(Markdown("**Modes with substantial differences (>30% change):**\n"))
display(Markdown(large_diffs.to_pandas().to_markdown(index=False, floatfmt=".1f")))

Modes with substantial differences (>30% change):

tmodetp mode_name legacy_count new_count change pct_change
0 Other 220 678 458 208.2
9 TNC 678 351 -327 -48.2

Analysis: Mode distributions differ primarily in: - HOV3+ vs Other - Different vehicle occupancy or shuttle classification logic - Drive-to-transit - Different transit access/egress mode detection methods - SOV/HOV splits - Possible differences in occupancy calculation

Both pipelines use standard DaySim mode codes; differences reflect classification logic changes.


7. Disaggregate Mode-Purpose Matrix [DETAILED COMPARISON]

Status: Cross-tabulation of tour mode by purpose for both pipelines

This section shows the detailed breakdown of tours by mode and purpose, highlighting where the pipelines differ.

Code
# Map legacy purposes to semantic categories for comparison
legacy_tours_semantic = legacy_tours.with_columns(
    pl.col("pdpurp").replace_strict(legacy_to_semantic, default=pl.col("pdpurp")).alias("semantic_purp")
)

# Create cross-tabulations
legacy_crosstab = (
    legacy_tours_semantic.group_by(["tmodetp", "semantic_purp"])
    .agg(pl.len().alias("count"))
    .pivot(values="count", index="tmodetp", on="semantic_purp", aggregate_function="sum")
    .fill_null(0)
    .sort("tmodetp")
)

new_crosstab = (
    new_tours.group_by(["tmodetp", "pdpurp"])
    .agg(pl.len().alias("count"))
    .pivot(values="count", index="tmodetp", on="pdpurp", aggregate_function="sum")
    .fill_null(0)
    .sort("tmodetp")
)

# Get all modes and purposes present in either pipeline
all_modes = sorted(set(legacy_crosstab["tmodetp"].to_list() + new_crosstab["tmodetp"].to_list()))
all_purposes = sorted(set(
    [c for c in legacy_crosstab.columns if c != "tmodetp"] +
    [c for c in new_crosstab.columns if c != "tmodetp"]
))

# Build full matrices with all combinations
def build_matrix(df, modes, purposes):
    matrix = np.zeros((len(modes), len(purposes)))
    for i, mode in enumerate(modes):
        row = df.filter(pl.col("tmodetp") == mode)
        if len(row) > 0:
            for j, purpose in enumerate(purposes):
                col_name = str(purpose)
                if col_name in df.columns:
                    matrix[i, j] = row[col_name][0]
    return matrix

legacy_matrix = build_matrix(legacy_crosstab, all_modes, all_purposes)
new_matrix = build_matrix(new_crosstab, all_modes, all_purposes)

# Calculate difference (new - legacy) and percent difference
diff_matrix = new_matrix - legacy_matrix
pct_diff_matrix = np.where(
    legacy_matrix > 0,
    (diff_matrix / legacy_matrix) * 100,
    np.where(new_matrix > 0, 100, 0)  # If legacy=0 but new>0, show +100%
)

# Mode and purpose labels
mode_labels = [mode_names.get(m, f"Mode {m}") for m in all_modes]
purpose_labels = [purpose_names.get(p, f"Purpose {p}") for p in all_purposes]

# Print summary of largest differences
diffs_list = []
for i, mode in enumerate(all_modes):
    for j, purpose in enumerate(all_purposes):
        diffs_list.append({
            'mode': mode_names.get(mode, f"Mode {mode}"),
            'purpose': purpose_names.get(purpose, f"Purpose {purpose}"),
            'legacy': int(legacy_matrix[i, j]),
            'new': int(new_matrix[i, j]),
            'diff': int(diff_matrix[i, j]),
            'pct_diff': pct_diff_matrix[i, j] if not np.isnan(pct_diff_matrix[i, j]) else 0
        })

import pandas as pd
diffs_df = pd.DataFrame(diffs_list)
top_diffs = diffs_df.nlargest(10, 'diff', keep='all')
from IPython.display import Markdown, display
display(Markdown("\n**Top 10 Mode-Purpose Combinations with Largest Absolute Differences:**\n"))
display(Markdown(top_diffs.to_markdown(index=False, floatfmt=".1f")))
C:\Users\nfournier\AppData\Local\Temp\ipykernel_1464\968575930.py:49: RuntimeWarning:

divide by zero encountered in divide

C:\Users\nfournier\AppData\Local\Temp\ipykernel_1464\968575930.py:49: RuntimeWarning:

invalid value encountered in divide

Top 10 Mode-Purpose Combinations with Largest Absolute Differences:

mode purpose legacy new diff pct_diff
HOV3+ Purpose 2 327 653 326 99.7
HOV2 Purpose 4 898 1171 273 30.4
HOV3+ Purpose 3 2114 2335 221 10.5
SOV Purpose 4 3460 3636 176 5.1
HOV2 Purpose 2 247 416 169 68.4
Other Purpose 9 24 170 146 608.3
Other Purpose 4 51 184 133 260.8
Other Purpose 1 19 147 128 673.7
HOV3+ Purpose 7 440 551 111 25.2
Walk-Transit Purpose 1 1124 1205 81 7.2

Tour Mode and Purpose Comparisons

Code
# Aggregate by mode (sum across all purposes)
legacy_by_mode = legacy_matrix.sum(axis=1)  # Sum across purposes for each mode
new_by_mode = new_matrix.sum(axis=1)

# Aggregate by purpose (sum across all modes)
legacy_by_purpose = legacy_matrix.sum(axis=0)  # Sum across modes for each purpose
new_by_purpose = new_matrix.sum(axis=0)

from scipy import stats

# Calculate correlations
r_mode = stats.pearsonr(legacy_by_mode, new_by_mode)[0]
r_purpose = stats.pearsonr(legacy_by_purpose, new_by_purpose)[0]

# Create subplots for mode and purpose scatter plots
fig_scatter = make_subplots(
    rows=1, cols=2,
    subplot_titles=(
        f'Mode Distribution Comparison<br><sub>R² = {r_mode**2:.4f}</sub>',
        f'Purpose Distribution Comparison<br><sub>R² = {r_purpose**2:.4f}</sub>'
    ),
    horizontal_spacing=0.12
)

# 1. Mode Scatter Plot
max_mode = max(legacy_by_mode.max(), new_by_mode.max())
fig_scatter.add_trace(
    go.Scatter(
        x=legacy_by_mode,
        y=new_by_mode,
        mode='markers',
        marker=dict(size=12, opacity=0.7, color='steelblue'),
        text=mode_labels,
        hovertemplate='<b>%{text}</b><br>Legacy: %{x:,.0f}<br>New: %{y:,.0f}<extra></extra>',
        name='Modes'
    ),
    row=1, col=1
)

fig_scatter.add_trace(
    go.Scatter(
        x=[0, max_mode],
        y=[0, max_mode],
        mode='lines',
        line=dict(color='red', dash='dash', width=1),
        name='Perfect Match',
        hoverinfo='skip',
        showlegend=False
    ),
    row=1, col=1
)

# 2. Purpose Scatter Plot
max_purpose = max(legacy_by_purpose.max(), new_by_purpose.max())
fig_scatter.add_trace(
    go.Scatter(
        x=legacy_by_purpose,
        y=new_by_purpose,
        mode='markers',
        marker=dict(size=12, opacity=0.7, color='darkgreen'),
        text=purpose_labels,
        hovertemplate='<b>%{text}</b><br>Legacy: %{x:,.0f}<br>New: %{y:,.0f}<extra></extra>',
        name='Purposes'
    ),
    row=1, col=2
)

fig_scatter.add_trace(
    go.Scatter(
        x=[0, max_purpose],
        y=[0, max_purpose],
        mode='lines',
        line=dict(color='red', dash='dash', width=1),
        name='Perfect Match',
        hoverinfo='skip',
        showlegend=False
    ),
    row=1, col=2
)

# Update axes
fig_scatter.update_xaxes(title_text='Legacy Tour Count', row=1, col=1)
fig_scatter.update_yaxes(title_text='New Tour Count', row=1, col=1)
fig_scatter.update_xaxes(title_text='Legacy Tour Count', row=1, col=2)
fig_scatter.update_yaxes(title_text='New Tour Count', row=1, col=2)

fig_scatter.update_layout(
    height=500,
    showlegend=False,
    template='plotly_white',
    hovermode='closest',
    margin=dict(t=80, b=50, l=50, r=50)
)

fig_scatter.show()

# 2. Sankey Diagrams: Mode and Purpose Flow
# Use matched tours to show how classifications flow from legacy to new

# Mode Sankey
mode_flow = (
    matched.group_by(['tmodetp_leg', 'tmodetp_new'])
    .agg(pl.len().alias('count'))
    .sort('count', descending=True)
)

# Create source/target lists for mode
mode_sources = []
mode_targets = []
mode_values = []
mode_labels = list(mode_names.values())
n_modes = len(mode_labels)

# Build node labels: Legacy modes first, then New modes
mode_node_labels = [f"Legacy: {label}" for label in mode_labels] + [f"New: {label}" for label in mode_labels]

for row in mode_flow.iter_rows(named=True):
    legacy_mode = row['tmodetp_leg']
    new_mode = row['tmodetp_new']
    count = row['count']

    # Find indices
    legacy_idx = list(mode_names.keys()).index(legacy_mode)
    new_idx = list(mode_names.keys()).index(new_mode) + n_modes

    mode_sources.append(legacy_idx)
    mode_targets.append(new_idx)
    mode_values.append(count)

# Purpose Sankey (with semantic mapping for legacy)
matched_purpose = matched.with_columns(
    pl.col("pdpurp_leg").replace_strict(legacy_to_semantic, default=pl.col("pdpurp_leg")).alias("pdpurp_leg_semantic")
)

purpose_flow = (
    matched_purpose.group_by(['pdpurp_leg_semantic', 'pdpurp_new'])
    .agg(pl.len().alias('count'))
    .sort('count', descending=True)
)

# Create source/target lists for purpose
purpose_sources = []
purpose_targets = []
purpose_values = []
purpose_labels_list = [purpose_names[k] for k in sorted([k for k in purpose_names.keys() if k < 10])]
n_purposes = len(purpose_labels_list)

# Build node labels: Legacy purposes first, then New purposes
purpose_node_labels = [f"Legacy: {label}" for label in purpose_labels_list] + [f"New: {label}" for label in purpose_labels_list]

purpose_keys = sorted([k for k in purpose_names.keys() if k < 10])

for row in purpose_flow.iter_rows(named=True):
    legacy_purpose = row['pdpurp_leg_semantic']
    new_purpose = row['pdpurp_new']
    count = row['count']

    try:
        legacy_idx = purpose_keys.index(legacy_purpose)
        new_idx = purpose_keys.index(new_purpose) + n_purposes

        purpose_sources.append(legacy_idx)
        purpose_targets.append(new_idx)
        purpose_values.append(count)
    except ValueError:
        continue

# Create Sankey diagrams
fig_sankey = make_subplots(
    rows=1, cols=2,
    subplot_titles=('Mode Classification Flow', 'Purpose Classification Flow'),
    specs=[[{"type": "sankey"}, {"type": "sankey"}]],
    horizontal_spacing=0.05
)

# Mode Sankey
fig_sankey.add_trace(
    go.Sankey(
        node=dict(
            pad=15,
            thickness=20,
            line=dict(color="black", width=0.5),
            label=mode_node_labels,
        ),
        link=dict(
            source=mode_sources,
            target=mode_targets,
            value=mode_values,
        )
    ),
    row=1, col=1
)

# Purpose Sankey
fig_sankey.add_trace(
    go.Sankey(
        node=dict(
            pad=15,
            thickness=20,
            line=dict(color="black", width=0.5),
            label=purpose_node_labels,
        ),
        link=dict(
            source=purpose_sources,
            target=purpose_targets,
            value=purpose_values,
        )
    ),
    row=1, col=2
)

fig_sankey.update_layout(
    height=600,
    margin=dict(t=80, b=50, l=50, r=50)
)

fig_sankey.show()

# Calculate summary statistics
total_legacy = int(legacy_matrix.sum())
total_new = int(new_matrix.sum())
total_diff = total_new - total_legacy
mean_abs_diff = float(np.mean(np.abs(diff_matrix)))
median_abs_diff = float(np.median(np.abs(diff_matrix)))
max_abs_diff = float(np.max(np.abs(diff_matrix)))

# Calculate Total Variation Distance (TVD) - normalized similarity metric
legacy_dist = legacy_matrix.flatten() / legacy_matrix.sum()
new_dist = new_matrix.flatten() / new_matrix.sum()
tvd = 0.5 * np.sum(np.abs(legacy_dist - new_dist))

# Calculate cosine similarity
from numpy.linalg import norm
cosine_sim = np.dot(legacy_matrix.flatten(), new_matrix.flatten()) / (norm(legacy_matrix.flatten()) * norm(new_matrix.flatten()))

from IPython.display import Markdown, display
display(Markdown(f"""
**Summary Statistics:**

- **Total Tours:** Legacy = {total_legacy:,} | New = {total_new:,} | Diff = {total_diff:+,}
- **Mean Absolute Difference:** {mean_abs_diff:.1f} tours per mode-purpose combo
- **Median Absolute Difference:** {median_abs_diff:.1f} tours
- **Max Absolute Difference:** {max_abs_diff:.0f} tours
- **Mode Correlation (R):** {r_mode:.4f} (R² = {r_mode**2:.4f})
- **Purpose Correlation (R):** {r_purpose:.4f} (R² = {r_purpose**2:.4f})
- **Total Variation Distance:** {tvd:.4f} (0 = identical, 1 = completely different)
- **Cosine Similarity:** {cosine_sim:.4f} (1 = identical pattern, 0 = orthogonal)
"""))

Summary Statistics:

  • Total Tours: Legacy = 39,587 | New = 39,525 | Diff = -62
  • Mean Absolute Difference: 61.0 tours per mode-purpose combo
  • Median Absolute Difference: 22.5 tours
  • Max Absolute Difference: 424 tours
  • Mode Correlation (R): 0.9954 (R² = 0.9907)
  • Purpose Correlation (R): 0.9897 (R² = 0.9795)
  • Total Variation Distance: 0.0616 (0 = identical, 1 = completely different)
  • Cosine Similarity: 0.9943 (1 = identical pattern, 0 = orthogonal)

Analysis:

The visualizations reveal high consistency between pipelines: - Scatter plots: Strong correlations (R² ≈ 1.0) show mode and purpose distributions align well - Sankey diagrams: Thick straight-through flows show classification agreement; thin crossing flows highlight specific differences - Summary metrics: Low Total Variation Distance and high Cosine Similarity confirm overall pattern consistency


Recommendations

Key Findings: 1. High tour match rate with excellent attribute consistency (>97%) 2. New pipeline correctly uses actual travel days; legacy artificially assigns days for weighting 3. New pipeline follows current DaySim conventions for time encoding and purpose codes 4. Both pipelines produce similar overall tour patterns

Further Investigation: 1. Analyze unmatched tours to understand tour extraction differences 2. Review mode classification logic for HOV3+, drive-to-transit, and Other categories


Summary

Pipeline Comparison Results:

  1. Tour Matching: High match rate by household, person, day, and times with >97% attribute consistency
  2. Time Encoding: Legacy uses HHMM format; new uses minutes since midnight (current DaySim spec)
  3. Purpose Codes: Legacy uses codes 10-11; new uses standard DaySim codes 8-9
  4. Person-Days: Legacy artificially assigns days based on completion flags for weighting purposes; new uses actual travel days
  5. Mode Classification: Differences in HOV3+, drive-to-transit, and other mode inference logic

Conclusion: The pipelines produce similar tour patterns. The new pipeline correctly represents actual travel days and follows current DaySim specifications. The legacy “wt-wkday_3day” dataset is a modeling construct that artificially reassigns travel days for weighting purposes.